Predicting non-response to multimodal day clinic treatment in severely impaired depressed patients: a machine learning approach

A considerable number of depressed patients do not respond to treatment. Accurate prediction of non-response to routine clinical care may help in treatment planning and improve results. A longitudinal sample of N = 239 depressed patients was assessed at admission to multi-modal day clinic treatment, after six weeks, and at discharge. First, patient’s treatment response was modelled by identifying longitudinal trajectories using the Hamilton Depression Rating Scale (HDRS-17). Then, individual items of the HDRS-17 at admission as well as individual patient characteristics were entered as predictors of response/non-response trajectories into the binary classification model (eXtremeGradient Boosting; XGBoost). The model was evaluated on a hold-out set and explained in human-interpretable form by SHapley Additive explanation (SHAP) values. The prediction model yielded a multi-class AUC = 0.80 in the hold-out set. The predictive power for the binary classification yielded an AUC = 0.83 (sensitivity = .80, specificity = .77). Most relevant predictors for non-response were insomnia symptoms, younger age, anxiety symptoms, depressed mood, being unemployed, suicidal ideation and somatic symptoms of depressive disorder. Non-responders to routine treatment for depression can be identified and screened for potential next-generation treatments. Such predictors may help personalize treatment and improve treatment response.

www.nature.com/scientificreports/ small sample sizes 16 , or retrospective information 17,19 . Real world samples are needed to build models that are relevant to clinic based samples (e.g., 20,22,23 ). When applied to clinical features and other data available at baseline to develop prediction models of subsequent depression course and treatment response, machine learning (ML) is promising [24][25][26] . These methods identify patterns of information to predict outcomes at the individual patient level 27 and can detect complex high-dimensional interactions, for instance in major depressive disorder (MDD) [28][29][30] .
Here, we focus on a group of depressed patients commonly presenting to psychiatric care, attaining multimodal depression treatment in a day clinic of a large urban psychiatric hospital in Switzerland. Patients were severely ill, with significant impairment in psychosocial functioning, mostly long pre-treatment absences at workplaces, high levels of suicidal ideation and drop-out rates 31 . Day clinic treatments for acute care are comparable to inpatient programs in terms of the intensity of the delivered multimodal treatment programs, with the difference that patients return home in the evenings and over the weekends 32 and have become increasingly relevant in the context of reducing costs of inpatient care 33 and particular advantages of the setting, e.g., intensive transfer into real life.
This longitudinal observational study in the context of a multimodal treatment aimed to (i) identify latent subgroups of treatment response trajectories amongst depressed day clinic patients, using Latent Growth Mixture Models (LGMM) 34 with standard clinical data that is readily available in most clinic settings. Given that such subgroups can be identified, our second aim was to (ii) detect early predictors for treatment outcome. Identification of such predictors could help optimize treatment planning and improve treatment outcomes.

Methods
Participants. We analysed data from 362 patients treated between July 2007 to April 2019. We consecutively included all patients with clinical depression (leading to exclusion of 42 cases with an initial HDRS-17 total score < 8; defined as no depression 35 ), with at least one completed HDRS-17 assessment from start of treatment until discharge (leading to exclusion of 31 cases) and treatment completion within a period of six weeks to 12 months (leading to exclusion of 50 cases). Patients with no HDRS-17 assessment did not differ from patients with assessment that were included in our analysis regarding age (t(359) = − 0.829, p = 0.408) and sex (Chi2(1) = 0.179, p = 0.673). This resulted in the final sample of 239 patients.
Treatment. The day clinic comprised a multimodal treatment program for depression delivered by a multiprofessional team consisting of consultant psychiatrists and resident psychiatrists, clinical psychologists, specialist nurses and occupational therapists. Treatment involved tailored individual and group psychotherapy of two sessions per week each. Individual psychotherapy included evidence-based psychodynamic and cognitivebehavioural psychotherapy provided by clinically experienced psychiatrists/residents in psychiatry and psychologists. Additional and supportive therapies, such as work-/occupational-and music therapy complement the treatment program. In line with current treatment guidelines, combination treatment, comprising a psychotherapeutic and psychopharmacological approach is common. The "dose" of all interventions is about 23-24 h of interventions per week (see figure S2 in the Supplementary material). Patients attend the day clinic five days per week. Treatment duration or discharge is decided by the responsible senior psychiatrist and the individual therapist together with the patient. Treatment duration is thus determined individually and can vary between patients.
Measures. As part of the regular clinical care, patients report demographic details, including employment and civil status as well as level of education. Patients were screened by clinically experienced physicians and psychologists for depression and anxiety symptoms as part of the routine clinical documentation at admission, after six weeks of treatment and at discharge. Depression and anxiety symptoms were assessed using the clinicianadministered Hamilton Depression Rating Scale (HDRS-17 36 ) and the Hamilton Anxiety Rating Scale (HARS 37 ). The HDRS-17 and HARS have extensively been used in research on depression 38 and anxiety 39 . The reliability of the total depression symptoms (Cronbach's α = 0.87) and the total anxiety symptoms (Cronbach's α = 0.84) derived in our investigation were comparable to other studies 38,39 . Procedure. Assessments were conducted in a day clinic of an urban university teaching hospital offering treatment for patients diagnosed with depression. Treatment of depressive symptoms is the core focus, but patients' diagnoses are not limited to major depressive disorders and comorbidity was frequent (see Table 1). Patients were screened for depression and anxiety symptoms as part of the routine clinical documentation at three time points, i.e., at admission, after six weeks of treatment and at discharge. All patients attended a clinical interview as part of the admission process. This included provision of ICD-10/DSM-IV diagnoses by trained raters supervised by a senior psychiatrist. Data were collected as part of the routine clinical care procedure and completely anonymized. In accord with local cantonal ethics guidelines and the Swiss Human Research act (HRA 40 ), no specific written informed consent was thus obtained.
The primary outcome of the predictive model was the grouping into non-responding patients vs. responding patients identified by LGMM. For model development of the LGMM, the outcome measure was the HDRS-17 scale, collected at admission, after six weeks of treatment and at discharge. Candidate predictor variables for the machine learning model were depression (HDRS-17) and anxiety (HARS; as mood and anxiety disorders are frequently comorbid and share symptoms 41 and anxiety has been shown to distinguish within depression severity 42 ) items at admission, as well as basic demographic variables, i.e., sex, age, civil status, level of education (none, elementary school, completed apprenticeship, secondary school, high school, university degree, other) employment before admission (fully employed, part-time employed, currently unemployed, unemployed).    44 . This is an iterative imputation method based on random forests and can handle mixed data type containing both categorical and numerical variables 44 . It handles high-dimensional data containing complex non-linear relations as well as unequal variable scales and provides built-in out-of-bag estimates of the imputation error rate, which has been shown to be accurate for missing values ratio of up to 30% 44 (please also refer to Figure S1 in the Supplementary material for a comparison of LGMM with imputed and non-imputed data). Missing values were ≤ 23% for the HDRS-17 total values during and at the end of treatment (all patients had admission HDRS-17 scores).
LGMM identified heterogeneous trajectories based on depression symptoms at admission, after six weeks and at discharge. Individuals were assigned to trajectories based on their most likely class membership. For identifying the best-fitting model we followed recommendations from the literature 46 . We examined the Bayesian (BIC), sample size-adjusted Bayesian (SSBIC), and Akaike (AIC) information criterion indices, entropy values, the Lo-Mendell-Rubin likelihood ratio test (LMRT), and the bootstrap likelihood ratio test (BLRT) (see Table S1 in the Supplementary materials). Our aim was to find the best-fitting model with lower values for the criterion indices, higher entropy values, and significant p-values for both the LMRT and the BLRT. Our selection of the final model was determined by these indices, overall model fit, but also interpretability 47 .
Predictive modelling. In the first model, we predicted trajectories of depressive symptom course as outcome of a multinomial classifier (eXtremeGradient Boosting: XGBoost) 48 . In a second model, we built a binary classification model (XGBoost) to predict the two groups: "responding" (Responding from moderate severity) vs. "non-responding" trajectory (as the Responding from severe depression class was small and may be clinically different in terms of severity, often referred to as "very severe depression" 35,49 ). XGBoost applies gradient descent optimization to minimize training error and is a tree-based ensemble method based on decision trees. It is a well-established and widely used machine learning approach due its great performance and high computational efficiency 48,50,51 . For data pre-processing, all numerical variables were normalized to range [0;1] and variables with near-zero-variance were removed using the built-in pre-process function from the caret R package 52 . Missing values were imputed using the missForest_1.4 package in R 44 . Missing values for the included variables in our sample were low (2%). To prevent "leakage" of information about the variable distribution in training and test set, we performed the pre-processing separately for the training and test set using caret. To maximize the likelihood of unbiased results, rigorous guards against over-fitting were implemented. First, the total data were randomly split into a 70% partition as training set and a 30% hold-out set to evaluate the predictive power of the final model in completely unseen new cases. To balance the dependent variable across data partitions, stratified random sampling was applied. During model training, 10 times repeated fivefold cross-validation was applied. For multi-class AUC we used the 'pROC' R package [53][54][55] . All analyses have been performed in R 3.5.3 using RStudio 1.2.1335.
Predictor importance ranking. Variables included in the final models were ranked with respect to their predictive power for the "responding" vs. the "non-responding" symptom trajectory memberships across the three assessments. We report methods for Explainable Machine Learning using SHAP (SHapley Additive exPlanation) values to examine and critically appraise on which features the model mainly relies to arrive at individual prediction outcomes. SHAP values were used to rank variables with respect to their ability to predict the "responding" vs. the "non-responding" trajectory 56 . This is an additive feature attribution method using kernel functions that enables consistent and locally faithful explanation of feature importance 56-58 . Ethics statement. Our study was conducted in accordance with the World Medical Association Declaration of Helsinki. Ethical approval was not required or obtained, since data were collected as part of the routine clinical care procedure and completely anonymized and did thus not fall under the Human Research Act (Humanforschungsgesetz).

Results
Patients reported severe symptoms of depression, including significant dysfunction and impairment in their everyday life, i.e.: 33.9% received disability annuity, 57.8% were unable to work before treatment, 63.6% were considerably or severely ill according to ratings on a standardised seven items Likert scale of the Clinical Global Impression Scale 59 used in Swiss psychiatric hospitals as routine procedure, and more than 30% suffered from suicidal ideation. Overall, treatment was effective, but ineffective for a subgroup of patients (see below). Demographic and clinical sample characteristics are shown in Table 1.

Identification of treatment response trajectories. The information indices and likelihood tests
showed improved fit as the number of classes increased from one to four; however, this was not the case for BLRT, a more robust indicator 60 , which was not significant in the model with four classes. Also, the addition of a fourth class resulted in one very small class (two patients)-making the model less parsimonious and less interpretable. Consequently, the best fitting model was a three-class solution with a varying interval of the assessment at discharge (AIC = 4484.38, BIC = 4533.05, SSBI = 4488.67, VLMRT = 0.0063, BLRT = 0.0000) with a good entropy of 0.81 (see Supplementary Table S1). The most common symptom trajectory was a "responding" (responding from moderate severity) class, starting from a moderately depressed level to a level below clini- www.nature.com/scientificreports/ cal depression 61 (n = 174; 72.8%), followed by a "non-responding" class, starting from a moderately to severely depressed level 35,49 (mean total value of HDRS-17 at admission = 21.9; n = 47; 19.7%) and no symptom improvement over time (mean total value of HDRS-17 at discharge = 22.3). A third class was less common, "Responding from severe depression", starting from a severely depressed level (mean total value of HDRS-17 at admission = 30) to improvement (mean total value of HDRS-17 at discharge = 14.3) although on average not reaching complete remission (n = 18; 7.5%).
The unconditional LGMM is shown in Fig. 1. The "non-responding" vs. the "responding" class memberships were used as the outcome for XGBoost.

Predicting response trajectories from baseline clinical and demographic indices. The XGBoost
algorithm for predicting all three symptom trajectories yielded a multi-class AUC = 0.80 in the hold-out set. The predictive power for the binary classification (XGBoost) for discriminating the "responding" and "non-responding" trajectory was AUC = 0.83 (sensitivity = 0.80, specificity = 0.77) (see Fig. 2). Figure 3 displays the predictor variable importance rankings using SHAP values 56 . The strongest features predicting the "responding" vs. the "non-responding" symptom trajectory memberships included mainly HDRS-17 items and some demographic characteristics: Insomnia-Falling asleep, younger age, Behaviour at interview (HARS), Somatic (sensory) anxiety (HARS), Insomnia-Waking up early, Anxiety Psychic, Depressed Mood, employment status (unemployed), suicidal idea-

Discussion
We identified heterogeneity in longitudinal trajectories of depression symptoms over the course of multimodal treatment using LGMM in a day clinic. Specifically, we identified treatment response trajectories resulting in a three-class solution comprising a "non-responding" subgroup (with no symptom improvement over the course of treatment), a "Responding from severe depression" subgroup (responding from severe depression to a level of moderate depression) and a "responding" subgroup (responding from moderate depression to a level below clinical depression) 61 . All groups were severely impaired when they started treatment at the day clinic. Both "Responding from severe depression" and "responding" (Responding from moderate severity) subgroups showed overall decline in their depression symptom severity of 50% or more at discharge compared to admission. The "non-responding" class showed no change in depression symptoms. Length of stay within the non-responder group was shorter on average (Welch's F(2,39.946) = 3.473, p = 0.041), since some patients may have been transferred to an inpatient ward. Next, we applied ML to predict these trajectories. Since we were most interested in differentiating nonresponse from response, we predicted group membership for the "non-responding" compared to the "responding" group using XGBoost 48 . SHAP Values 56 identified insomnia (problems falling asleep being the most predictive item), younger age, anxiety symptoms, depressed mood, being unemployed, suicidal ideation and several somatic indicators of depressive disorders as most important predictors for non-response. The overall prediction model yielded high predictive power. These outcomes compare favourably to other studies using similar models in treatment outcome prediction 28,[62][63][64] .
The model identified sleep disturbances as one of the top predictors of poorer outcome, i.e., non-response compared to remitting depression symptoms. Patients reporting sleep problems initially were more likely to be in the non-responding group. In line with our findings, Troxel et al. 65 showed that problems falling asleep significantly increased the risk of non-remission following pharmacologic and/or psychotherapeutic treatment for depression and identified this symptom as one of the strongest predictors in their study. Basic neuroscience studies consistently link sleep to memory, learning, and, in general, to the mechanisms of neural plasticity 66 .
An increasing body of evidence shows that sleep plays a pivotal role in the orchestration of neuroplasticity 67 . Such processes are paramount for processing and benefiting from psychotherapeutic treatment, which was a fundamental pillar of treatment in the day clinic treatment investigated in this study. Sleep problems may thus have resulted in decreased plasticity and capacity to learn during psychotherapy and consequently increased the probability of non-response to treatment. Together, these results link sleep to the recovery processes and suggest a target for depression treatment. Improving sleep, for instance with cognitive-behavioural treatment approaches, may lead to increased plasticity, capacity to learn and process (emotional) memories and thus benefit from treatment 66,68 .
Suicidal ideation at admission was another indicator of non-response to depression treatment. The current data were collected as part of routine clinical care, hence including a significant proportion of patients with suicidal symptoms such as thoughts and ideations, a group often formally excluded from randomised controlled treatment trials 69 . Our model results thus include relevant information regarding this group, highlighting a subgroup of patients in need of specific attention and potentially augmented treatment regimens 70 . Whilst there are only few evidence-based treatment programs for suicidal psychiatric patients, potential options exist, including for instance, cognitive therapy interventions designed to prevent repeat suicide attempts in adults who recently attempted suicide 71 or dialectical behaviour therapy, which was also shown to be effective in reducing suicide attempts 72 .
In line with recent studies underlining the impact of social and economic risks associated with neighbourhood safety, educational attainment, housing stability on mental health outcomes 73 , demographics were also ranked as key predictors in our dataset. Interestingly, and in contrast to a previous study that predicted course trajectories of depressed outpatients 74 , younger age was associated with heightened probability of being a non-responder. Unfortunately, we do not have detailed information on the age of first depressive symptoms, but previous studies have associated the early-onset subtype of depression with worse outcomes overall [74][75][76] . Unemployment, the second most predictive demographic index, may be a risk for mental health or the result of it [77][78][79] and hamper effects of psychotherapy 80,81 . Marriage has been shown to positively affect well-being 82 , but may also be associated with negative consequences and individual, interpersonal, and structural features that may impede recovery 83 . Sex was not amongst the top 15 predictors and this is in line with previous studies 20,84 , but also see 74,85 .
There are several limitations to the study. First, exact information on number and severity of previous depressive episodes and duration of the current depressive episode was not available and their predictive value thus remains to be tested in such predictive models 20,23,86 . Second, we examined a rather small spectrum of predictors (e.g., no biological markers, e.g. 87 ). However, if replicated across other samples and treatment trajectories, the items assessed as part of this study and those selected by the models are easily integrated into routine clinical practice. Third, in LGMM, class membership assignments differ naturally in accuracy for individual patients and were accurate for some patients and less precise for others. Fourth, data were collected as a routine clinical quality assessment and provide naturalistic data on heterogeneity in treatment responses and their prediction for those patients treated at the day clinic for 6 weeks to 12 months. We replicated the prediction in the holdout sample within our dataset; a reproduction nevertheless is pending for other samples. Fifth, a structured Axis IIdiagnostics is missing. Sixth, within-treatment characteristics (for which there are no data available, e.g., change of medication or adherence to psychotherapy) may differ between patients and between trajectory classes and www.nature.com/scientificreports/ Figure 5. SHAP values for the hold-out set. This figure displays the decision rule for each feature for predicting "responding" vs. the "non-responding" symptom trajectory memberships. www.nature.com/scientificreports/ should be included in future studies. Seventh, a larger sample size would enable more comprehensive computations. Finally, response trajectories relied on measurements at three time points. More measurements would allow for more detailed assessments of trajectories (e.g., 88 ) and this may also require larger sample sizes for their identification. Future studies could expand on our findings and test whether self-report questionnaires of the same symptoms derive at the same results providing easier data collection. These data can also be used within a prediction model to help dynamically enhance psychotherapy outcomes 89 . These limitations are met by several key strengths. The naturalistic study design made use of a consecutive sample, warranting high external validity. Since treatment was comparable for all patients, internal validity was acceptable. Additionally, the longitudinal design reveals information about treatment outcome. Also, we only used clinician administered instruments, conducted by psychiatrists/residents in psychiatry or psychotherapists. Taken together, we identified heterogeneity in multimodal treatment outcome in depressed psychiatric patients. A predictive algorithm based on basic clinical and demographic data obtained in routine clinical practice identified treatment non-responders from those who responded during treatment with high predictive accuracy. These results have clinical relevance as the items selected by our algorithm could be easily obtained in clinical practice. In this heterogeneous sample of patients presenting with depression, our model was able to predict response versus non-response to multimodal treatment. Given replication of our results in other clinical settings, and possibly other groups and health care systems, such early predictors of treatment response could help pave the way towards more effective personalized therapeutic approaches and optimize treatment outcomes. www.nature.com/scientificreports/